# MVGE: Scale-invariant and Temporal-consistent Monocular Video Geometry Estimation

**Paper ID**: 12646 (Submitted to ICLR 2026)

This repository contains the supplementary materials for **MVGE**: a novel approach for estimating **scale-invariant** and **temporally consistent** 3D geometry from extended monocular video sequences, capable of processing **hundreds of frames** while maintaining both **geometric accuracy** and **temporal stability**.

## Project Structure

├── img/
│   ├── logo.png                    # Project logo
│   └── method.png                  # MVGE framework visualization
├── static/
│   ├── css/                        # Stylesheets for project page
│   └── js/                         # JavaScript for interactivity
├── video/
│   ├── v/                          # MVGE performance demonstrations
│   ├── compare/
│   │   ├── vggt/                   # Comparison with VGGT
│   │   └── moge/                   # Comparison with MoGe baseline
│   ├── othersize/                  # Portrait format video processing
│   ├── point/                      # 4D scene reconstruction videos
│   └── longvideo/                  # Extended sequence processing
├── index.html                      # Main project page
└── README.md                       # This documentation

## Project Page Overview

The project page (accessible via `index.html`) demonstrates:

### 1. Model Demonstrations
Top section showcases MVGE's performance on diverse open-world videos, highlighting our ability to generate **temporally consistent and scale-invariant 3D geometry** from monocular videos with superior accuracy across extended sequences.

### 2. Framework Architecture  
MVGE consists of a ViT backbone processing video input frames, followed by a temporal decoder with cross-attention and dynamic NTK scaling RoPE. The framework enforces cross-frame geometric consistency at global and local levels, applies hierarchical temporal supervision across multiple temporal strides (δ = 1, 2, 4, 8), and utilizes frequency-modulated positioning with train-time sequence stretching for robust extrapolation to extended sequences.

### 3. Comparative Analysis
Side-by-side qualitative comparisons against state-of-the-art methods:
- **vs. VGGT**: Superior temporal stability, enhanced geometric detail preservation, and improved robustness on reflective surfaces
- **vs. MoGe**: Significantly better temporal consistency and robustness to lighting variations across video sequences

### 4. Portrait Video Processing
Demonstrates MVGE's adaptability to portrait-format videos, maintaining geometric precision and temporal consistency across diverse scene types while seamlessly handling vertical aspect ratios.

### 5. 4D Scene Reconstruction  
Temporal 4D reconstruction combining MVGE point maps with MegaSAM-estimated camera poses. Results demonstrate coherent scene geometry using only the first stage of MegaSAM (pose estimation) without requiring the second stage depth optimization step.

### 6. Extended Sequence Processing
Long-range temporal inference for sequences exceeding 300 frames using 256-frame sliding windows with overlap strategy. This approach, similar to Video Depth Anything and DepthCrafter but with larger windows, effectively prevents long-term scale drift and demonstrates infinite sequence processing capability.


## Key Technical Innovations

- **Viewpoint-Invariant Geometry**: Cross-frame geometric constraints enforcing multi-scale consistency by transforming points from multiple perspectives into randomly selected reference frames using camera pose integration

- **Appearance-Invariant Learning**: Hierarchical temporal supervision across exponentially increasing time intervals with frame-specific augmentations that decouple geometric structure from transient visual conditions

- **Adaptive Frequency-Modulated Positioning**: NTK-guided RoPE with dynamic scaling and training-time sequence stretching (50% probability) enabling extrapolation to sequences orders of magnitude longer than 24-frame training examples

## Performance Highlights

- **24.2%** reduction in relative point map error on ScanNet
- **34.9%** improvement in temporal alignment error  
- **39.1 FPS** processing speed for 300-frame sequences at 378×672 resolution

---

*Code and trained models will be publicly released upon paper acceptance.*